Big Data Cup Starter
2024-02-05
Plan for Big Data Cup
https://www.stathletes.com/big-data-cup/
There will be 2 categories for participants:
Teams can be 1-4 participants.
Finalists will be selected* on will have the opportunity to present their findings to our panel of sport executives at our reception at Rotman on April 19th.
Prizes will be awarded to top qualifiers.
Participation in Big Data Cup competition is open any individual regardless of background, experience, previous analysis, or public work.
https://www.stathletes.com/big-data-cup/
2024 Timeline and Key Dates
https://www.stathletes.com/big-data-cup/
All are encouraged to submit a written report that will be due March 8th, 2024.
Maximum 6 pages, including figures (size limit 10GB on submission).
Submissions can be emailed to: bigdatacup@stathletes.com
with subject line: Big Data Cup 2024.
Please note that email size is limited to 25MB, to send larger submissions (up to 10GB), use Dropbox, Google Drive or other file-sharing services and include the link in your submission email.
There are two research areas
Please identify in the title which area you are focusing on:
You should know a little bit about hockey before you start. If you know soccer/football, you’re well on your way. Here’s a football-hockey translation guide I wrote in 2020.
https://www.stats-et-al.com/2020/08/soccer-to-hockey-translation-guide.html
From https://github.com/bigdatacup/Big-Data-Cup-2024, we
download BDC_2024_Womens_Data.csv and load it
## Date Home.Team Away.Team Period Clock
## 1 2023-11-08 Women - United States Women - Canada 1 20:00
## 2 2023-11-08 Women - United States Women - Canada 1 19:57
## 3 2023-11-08 Women - United States Women - Canada 1 19:54
## 4 2023-11-08 Women - United States Women - Canada 1 19:52
## 5 2023-11-08 Women - United States Women - Canada 1 19:50
## 6 2023-11-08 Women - United States Women - Canada 1 19:50
## Home.Team.Skaters Away.Team.Skaters Home.Team.Goals Away.Team.Goals
## 1 5 5 0 0
## 2 5 5 0 0
## 3 5 5 0 0
## 4 5 5 0 0
## 5 5 5 0 0
## 6 5 5 0 0
## Team Player Event X.Coordinate
## 1 Women - Canada Marie-Philip Poulin Faceoff Win 100
## 2 Women - Canada Jocelyne Larocque Puck Recovery 50
## 3 Women - Canada Jocelyne Larocque Play 3
## 4 Women - Canada Renata Fast Play 6
## 5 Women - Canada Emma Maltais Incomplete Play 48
## 6 Women - United States Hilary Knight Takeaway 141
## Y.Coordinate Detail.1 Detail.2 Detail.3 Detail.4 Player.2
## 1 42 Backhand Taylor Heise
## 2 10
## 3 59 Indirect Renata Fast
## 4 21 Direct Emma Maltais
## 5 2 Direct Marie-Philip Poulin
## 6 72
## X.Coordinate.2 Y.Coordinate.2
## 1 NA NA
## 2 NA NA
## 3 4 35
## 4 48 2
## 5 62 28
## 6 NA NA
##
## Dump In/Out Faceoff Win Goal Incomplete Play Penalty Taken
## 591 209 20 773 37
## Play Puck Recovery Shot Takeaway Zone Entry
## 2333 2266 403 287 540
The coordinates are already adjusted for possession. For example, all the shots are between 125 and 200 feet in the x-coordinate, suggesting they are all taken from the attacking zone of whomever is shooting.
plot(bdc_shots$X.Coordinate, bdc_shots$Y.Coordinate, xlim=c(0,200), ylim=c(0,85))
abline(v=c(125,189,200), lwd=3, col=c("Blue","Red","Black"))Other plays don’t have this location restriction. Plays can happen everywhere
# To do: Draw a whole rink overlay
plot(bdc$X.Coordinate, bdc$Y.Coordinate, xlim=c(0,200), ylim=c(0,85))
abline(v=c(125,189,200), lwd=3, col=c("Blue","Red","Black"))
abline(v=c(75,11,0,100), lwd=3, col=c("Blue","Red","Black","Black"))
abline(h=c(0,85), lwd=3, col="Black")Zone Entries are recorded in this
bdc_zone = subset(bdc, Event == "Zone Entry")
plot(bdc_zone$X.Coordinate, bdc_zone$Y.Coordinate, xlim=c(0,200), ylim=c(0,85))
abline(v=c(125,189,200), lwd=3, col=c("Blue","Red","Black"))
abline(v=c(75,11,0,100), lwd=3, col=c("Blue","Red","Black","Black"))
abline(h=c(0,85), lwd=3, col="Black")Including dump-ins
bdc_zone = subset(bdc, Event %in% c("Dump In/Out", "Zone Entry"))
plot(bdc_zone$X.Coordinate, bdc_zone$Y.Coordinate, xlim=c(0,200), ylim=c(0,85))
abline(v=c(125,189,200), lwd=3, col=c("Blue","Red","Black"))
abline(v=c(75,11,0,100), lwd=3, col=c("Blue","Red","Black","Black"))
abline(h=c(0,85), lwd=3, col="Black")Possible sources of inspiration
Today we’re going to apply the principles of Chapter 9.6, K-means in R, of the DSCI 100 textbook, found at https://ubc-dsci.github.io/introduction-to-datascience/clustering.html#k-means-in-r towards a data frame from the 2016-17 Regular Season of the National Hockey League.
## Warning: package 'ggplot2' was built under R version 4.2.2
## # A tibble: 66,771 × 50
## season gcode refdate event period seconds etype a1 a2 a3 a4 a5
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2.02e7 20001 5399 8 1 71 SHOT 2964 1836 1851 1743 2211
## 2 2.02e7 20001 5399 19 1 173 SHOT 2254 2930 2004 2443 2514
## 3 2.02e7 20001 5399 28 1 241 SHOT 2964 1836 1851 2443 2211
## 4 2.02e7 20001 5399 32 1 286 SHOT 2964 1836 1851 1743 2514
## 5 2.02e7 20001 5399 49 1 406 SHOT 2043 2622 1021 2443 2514
## 6 2.02e7 20001 5399 52 1 450 SHOT 2964 1836 1851 2966 2211
## 7 2.02e7 20001 5399 59 1 509 SHOT 2254 2930 2004 2443 2211
## 8 2.02e7 20001 5399 62 1 540 SHOT 2910 2912 2965 2966 2211
## 9 2.02e7 20001 5399 75 1 616 SHOT 2254 2930 2004 2211 2514
## 10 2.02e7 20001 5399 76 1 625 SHOT 2254 2930 2004 2211 2514
## # … with 66,761 more rows, and 38 more variables: a6 <dbl>, h1 <dbl>, h2 <dbl>,
## # h3 <dbl>, h4 <dbl>, h5 <dbl>, h6 <dbl>, ev.team <chr>, ev.player.1 <dbl>,
## # ev.player.2 <dbl>, ev.player.3 <dbl>, distance <dbl>, type <chr>,
## # homezone <chr>, xcoord <dbl>, ycoord <dbl>, awayteam <chr>, hometeam <chr>,
## # home.score <dbl>, away.score <dbl>, event.length <dbl>, away.G <dbl>,
## # home.G <dbl>, home.skaters <dbl>, away.skaters <dbl>,
## # adjusted.distance <lgl>, shot.prob.distance <lgl>, …
This data frame consists of every officially recorded shot in 1226 of the 1230 games of the regular season. (The play-by-play details of the remaining 4 games are not available at the source, nhl.com). Scoring officials recorded 66,771 shots during these games. For most of these shots, we know, among other variables:
There are many, many questions we can answer using this data, including
For now we are going to narrow our focus on two questions that K-means clustering can answer: Are there archtypal shot locations, and if so where are they. Taking a scatterplot of the raw data, we see some patterns and challenges:
gr1 <- ggplot(df_shots, aes(x = xcoord, y = ycoord)) +
geom_point() +
xlab("x from center ice (feet)") +
ylab("y from center ice (feet)")
plot(gr1)First, there are far too many points to meaningfully visualize many details and trends. Over the course of a season, shots come from almost everywhere, but there is a dramatic drop-off in density between \(x=-25\) and \(x=25\). The following diagram of NHL rink dimensions can explain why: players almost always take shots between the goal and the blue line.
We can simplify our analysis examining the absolute value of the x-coordinates instead of the x-coordinates themselves. We also flip the y-coordinate so that we’re doing a rotation around center ice and things like left- and right-wing mean the same thing regardless of the side of the ice.
We can also use a graphical method called contour plotting to visualize the many overlapping data points and better see locations of high and low shot density. In the following graph, the brightly coloured regions is are the locations of the highest shot density.
gr2 <- ggplot(df_shots, aes(x = xcoord, y = ycoord)) +
geom_density_2d_filled() +
xlab("x from center ice (feet)") +
ylab("y from center ice (feet)")
plot(gr2)There are either three or five locations of relatively high shot density. The dominant location is immediately in front of the net at \(x=89, y=0\). The secondary locations are at the back corners, near \(x=35, y= \pm 25\). The tertiary locations are between the back corners and the net, near \(x=60, y= \pm 15\); these locations are called ‘the slots’.
Because the values already represent locations in physical space, we will not standardize.
Let’s try a k-means clustering using different values for
k and compare the Within (cluster) Sum of Squares Distance,
or WSSD. Recall from Section 9.4 at https://ubc-dsci.github.io/introduction-to-datascience/clustering.html#k-means
that we want both a small WSSD and a small number of clusters, which
will find at the ‘elbow’ in the following scree plot.
Unfortunately, there is no well-defined elbow here, but both and \(k=3\) and \(k=5\) are good candidates. We will use \(k=5\).
wssd <- rep(NA,9)
for(k in 2:10)
{
shot_clust <- kmeans(df_shotloc, centers = k)
wssd[k-1] <- shot_clust$tot.withinss
}
centers <- 2:10
dat <- data.frame(centers, wssd)
gr3 <- ggplot(dat, aes(x=centers, y=wssd)) +
geom_line() +
geom_point() +
xlab("number of clusters") +
ylab("WSSD")
plot(gr3)Do the centers of the clusters align with our previous visuals-based
intuition? We can find the centers of each of the clusters from the
object returned by the function kmeans. We can also find
the number of shots that fit into each cluster.
Visually, we found one large cluster near \(x=80, y=0\), two medium-sized clusters at \(x=35, y= \pm 25\), and two small clusters at \(x=60, y= \pm 15\).
## xcoord ycoord
## 1 76.54126 -0.6181961
## 2 38.03774 -18.6196618
## 3 65.72413 22.4948896
## 4 65.78030 -23.5139906
## 5 38.15829 21.0938875
Let’s add the centers as bright numbers to the contour plot to back this up.
shot_centers <- as.data.frame(shot_clust_5$centers)
gr4 <- gr2 + geom_point(data=shot_centers, aes(x=xcoord, y=ycoord),
inherit.aes = FALSE, col="Red", size = 7, pch=as.character(1:5))
plot(gr4)The centers of the clusters agree with our visual inspection. Cluster 3 represents the shots near the goal net. Clusters 2 and 4 represent the shots by the top and bottom back corners, respectively. Clusters 1 and 5 represent the bottom and top slots, respectively.
What are the relative sizes of these clusters?
## [1] 21642 11474 11643 11329 10683
The cluster of shots near the goal is twice as large as each of the other two clusters. From the contour plot, it may seem as the number of shots near the net should be much larger, but the other four clusters would appear less prominently if they were more diffuse. That is, if corner (2 and 4) and slot (1 and 5) shots came from more varied locations than net (3) shots, they might appear less bright in a contour plot.
We can find relative diffusion by looking at the root mean-squared distance, which is \(RMSD_i = \sqrt{WSSD_i / n_i}\), where \(n_i\) is the size of each cluster, \(WSSD_i\) is the sum of squared distance within each cluster, and \(i = 1, \ldots , k\).
## [1] 10.17421 14.98870 12.67287 12.35079 13.93232
The cluster of net shots is the least diffuse, \(RMSD_3 \sim 10 \mathrm{ft}\), followed by the corner shots, \(RMSD_{\{2,4\}} \sim 12.5 \mathrm{ft}\), followed by the slot shots \(RMSD_{\{1,5\}} \sim 14.5 \mathrm{ft}\).
Suggested Exercises:
The variable ev.team contains a three-character code
of the team that took the shot in question. Use the filter
function and the code in this case study to explore the shooting pattern
of your favourite team (sorry, no Golden Knights or Kraken in 2016-17.
Does your team follow the same shooting patterns as the league as a
whole?
Find the distance from each cluster mean to the center of the net, at \(x=89, y=0\).
Repeat questions 1 and 2 for the case where there are 3 clusters rather than 5. Briefly describe the 3 clusters that emerge.
## [1] "TOR" "OTT" "STL" "CHI" "CGY" "EDM" "L.A" "S.J" "MTL" "BUF" "NYR" "NYI"
## [13] "WSH" "PIT" "BOS" "CBJ" "DET" "T.B" "MIN" "CAR" "WPG" "ANA" "DAL" "NSH"
## [25] "PHI" "N.J" "FLA" "COL" "ARI" "VAN"
Hockey rink dimensions chart from https://www.sportsfeelgoodstories.com/hockey-rink-dimensions-size-diagram/
Now lets try contour plotting to visualize Big Data Cup data points. As before, the brightly coloured regions is are the locations of the highest shot density.
gr_b1 <- ggplot(bdc_shots, aes(x = X.Coordinate, y = Y.Coordinate)) +
geom_density_2d_filled() +
xlab("x from center ice (feet)") +
ylab("y from center ice (feet)")
plot(gr_b1)We can also find any cluster centers with kmeans.
Starting with k=5.
For the men, we found one large cluster near \(x=180, y=42.5\), two medium-sized clusters at \(x=135, y= 42.5 \pm 25\), and two small clusters at \(x=160, y= 42.5 \pm 15\). (After adjusting for the BDC coordinate system)
set.seed(12345)
bdc_shotloc = subset(bdc_shots, select = c(X.Coordinate, Y.Coordinate))
shot_clust_5w <- kmeans(bdc_shotloc, centers = 5)
shot_clust_5w$centers## X.Coordinate Y.Coordinate
## 1 142.6986 31.15068
## 2 178.7368 41.83333
## 3 167.9231 61.94872
## 4 167.0132 18.09211
## 5 141.0000 66.95161
Try again for k=3. This time the corners disappear and
the ‘slot’ shots remain.
## X.Coordinate Y.Coordinate
## 1 151.1250 23.59167
## 2 177.6215 43.33898
## 3 148.6038 65.08491
Set up all the events on a ggplot
bdc$minute = 20*(bdc$Period - 1) + 19 - as.numeric(str_extract(bdc$Clock, "^[0-9]+"))
bdc$minute = pmax(1, bdc$minute)
bdc$minute = pmin(60, bdc$minute)
gr_all <- ggplot(bdc, aes(x = X.Coordinate, y = Y.Coordinate)) +
geom_point() +
xlab("x from defensive end (feet)") +
ylab("y from camera side (feet)")
plot(gr_all)Use GGanimate, thanks to https://www.datanovia.com/en/blog/gganimate-how-to-create-plots-with-beautiful-animation-in-r/
## Warning: package 'gganimate' was built under R version 4.2.3
gr_anim <- gr_all + transition_time(minute) +
labs(title = "Minute: {frame_time}")
#animate(gr_anim, fps=5)
gr_animLet’s try this with a spatial correct. Instead of feet from defensive base, let’s use feet from visitor base.
idx_team_is_home = which(bdc$Team == bdc$Home.Team)
bdc$X.Coordinate.adj = bdc$X.Coordinate
bdc$X.Coordinate.adj[idx_team_is_home] = 200 - bdc$X.Coordinate[idx_team_is_home]
bdc$Y.Coordinate.adj = bdc$Y.Coordinategr_all2 <- ggplot(bdc, aes(x = X.Coordinate.adj, y = Y.Coordinate.adj)) +
geom_point() +
xlab("x from defensive end (feet)") +
ylab("y from camera side (feet)")
plot(gr_all2)